The Plan

  1. Make sure you have all tools installed. mafft, trimal, iqtree at the very least.
    • use conda env create -f ./conda_environment.yaml to create this environment and re-open jupyter inside the environment if you haven't already.
  2. Get a set of sequences to build a tree of.
    • for example from the 1kP project, a paper you found, or from blast.
    • Subset sequences if there are too many.
    • Do you have an outgroup to root your tree on? (unless you won't root your tree)
    • Do you have trusted or verified sequences to make sense of the different clades you may get in your tree?
  3. Align these to each other.
    • using mafft or another aligner like clustalw, tcoffee or prank.
  4. Trim the alignment, removing gaps.
    • with trimAL, optimise trimming to both make your tree building faster and more reliable.
  5. Build a fast phylogentic tree.
    • with fasttree, or with iqtree --fast
  6. Build a thorough phylogenetic tree.
    • We combine substitution model fitting and tree building in IQtree.
  7. Visualise and share your tree

Annotate and log

A jupy notebook like this, is your labjournal for doing research on the computer. In here you keep a record of

  • what files you use
  • how you made new files
  • where you stored these
  • etcetera.

You do this just by writing the code and keeping the output saved in here. However, one thing is not kept automatically, and that is the choices you make. Hence for transparent and propper science, it is vital that you make this notebook your own, by writing all observations, desicions etcetera in here.

image.png

Describe how you got the sequences you're making a tree of, why you got those sequences there, and what question you are trying to answer by making this tree.

Compared to the Basal v2 workflow, this is changing:

We suspect that the variety of M only, and MIKCc and MIKC sequences is messing up our tree inference. Hence this workflow aims to infer a tree of only MIKCc and MIKC sequences. That should also be the case for the root of the tree, algal sequences. In a separate notebook I collected algal sequences which at sight seem to have at least the MIK domains.

I'm choosing here to remove short sequences after aligning for the literature review I would need to do up front, or working with lengh only, is not satisfactory.

Extra outgroup sequences:https://www.uniprot.org/uniprot/C4IGU9 https://www.uniprot.org/uniprot/C4IGU6

And the paper where I got these: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC548990/

Just FYI, a screenshot of the domains from the basalv2 workflow:

image.png

1. Composing your fasta

Store the sequences you want to make a tree of in the data directory and make the inseq variable the name of your input fasta without the extention:

In [2]:
inseq=MIKC_orthogroup

Let's look at the first ten lines of your fasta to double-check.

In [3]:
head data/$inseq.fasta
>DUQG-2010341-Tanacetum_parthenium
MGRGKIEIKRIENTSNRQVTYSKRKNGIIKKAKEITVLCDSNVSLVIYGSSGKMYEYCSPKTNLIDMLDRYQRLSGNKLWDAKHENLQNEIDRIKKENESMQIELRHLKGEDITSLNYEELIAYEDALENGLTNIREKKDEIPKIMRKHEQDLE
>NVGZ-2012083-NVGZ-Cephalotaxus_harringtonia-2_samples_combined
QKEKGSQLWDTEHQNLYNEIKRLKEENEKLKSNLRHMKGEDINSLRVEELCLLEQALEIAIDRVRTKKDQIFMQELYSSRKRLSSLEEENKRLRGIAGMRGAIIMDQQQQHGYGHVYAEGEGGTCMHLGPAFRVQPSHPNLKD
>XLGK-2010739-Podocarpus_rubens
MGRGRVELKRIENKINRQVTFSKRRNGLLKKAYELAVLCDAEVAVIIFSSRGKVYEYGSTGSTLKTLERYQKCTYVLQESTVTPDREAQNWNQEVTKLKAKLEFLQQTQRRLLGEELGPLSIKELQQLESQLEIGVKNVRTRREQSTLETIDDLKKRTQQLIETNKALRKQCAMQGQGFASLQAAPHSWDSNAVPNVAYALPPNQSNPVDCEPTLHIGYHYGHPETSMPRQEQTPNGYMQGWML
>Carpa_ASGPBv0_4-evm_model_supercontig_26_316-Carica_papaya
MGRGKIEIKRIENLSNRQVTYSKRRNGIIKKAKEITVLCDARVSLIIFASSGKMHEYCSPSTSLTNMLDEYHKAGKPRLWDAKHENLNNEIERVKKENDNMQIELRHLRGEDITSLNHRDLMALEETLETGLASVRNKQMEVLKMMRRNEKILEEENRRLSFALQQQEIAIENSAREMENGYQQRMREYNAHMPFAFRVQPIQPNLQDRI
>NFXV-2017621-Mumea_americana
MAREKIQIRKIDNATARQVTFSKRRKGLFKKAEELSVLCDADVALIIFSSTGKLFEYSSSSMKEILERHHLHSKNLEKLEQPSLELQLVENGNCSRLSKEVAEKSHQLRKLRGEELQGLCLEELLQLEKSLETGLSRVIENKEGKIMKEINELQRKGKELMEENELLKQQVVEMTNGQRQVPTDSDNIIYEEGQSSESITNACNSSGPPNDYESSDTSLKLGLPYS

Check the nr. of sequences in your fasta file:

In [4]:
grep '>' data/$inseq.fasta -c
16003

I'm taking the sequences of selection basalv2 and the selection-agal

In [9]:
cat data/MIKC_orthogroup_selection-algae.fasta data/MIKC_orthogroup_selection-basal-v2.fasta > data/MIKC_orthogroup_selection-basal-v3.fasta

1.3 add guide sequences

If you have versions of your sequence of special interest, or functionally verified ones, be sure to add them!

I imagine you have your guide sequences named something like data/guide_sequences_v1.fasta Combine the two files like so and update the $inseq variable with the new name if you are done in this section.

In [10]:
# for the selection workflow
inseq=MIKC_orthogroup
cat data/outgroup_algae.fasta \
    data/Zhang-2019-fig4_MIKC_Azfi-gymnosperms_selection-v3.faa \
    data/Azfi-v1_MIKC_manual.faa \
    data/azolla_filiculoides/filiculoides_fernbase_blast_results.fasta  \
    data/piceaabies_sequences/picea-abies_MADs_blast_results_uniq.fasta \
    data/salvinia_sequences/salivinia_fernbase_blast_results_uniq.fasta  \
    data/Henriette_guideazfiv2.fasta \
    data/"$inseq"_selection-basal-v3.fasta \
    > data/"$inseq"_selection-basal-v3_guide-v4.fasta
In [12]:
grep '>' data/"$inseq"_selection-basal-v3_guide-v4.fasta -c
grep '>' data/"$inseq"_selection-basal-v3_guide-v4.fasta
425
>tr|Q5KU24|Q5KU24_COLSC MADS-box transcription factor CsMADS1 OS=Coleochaete scutata OX=3125 GN=csmads1 PE=2 SV=1
>tr|Q5KTX4|Q5KTX4_9VIRI MADS-box protein CpMADS1 (Fragment) OS=Closterium peracerosum-strigosum-littorale complex OX=34146 GN=CpMADS1 PE=2 SV=1
>cgMADS1_BAD88436
>CRM1_CAA69276
>CRM2_CAA69277
>CRM3_CAA69407
>CRM4_CAA69408
>CRM5_CAA69409
>CRM6_CAA69410
>CRM7_CAA69411
>OPM1_CAA69412
>CMADS1_AAC24492
>CMADS2_AAC24493
>CMADS3_AAC24319
>CMADS4_AAC24320
>CMADS6_AAC24325
>SmMADS6
>SmMADS4
>SmMADS3
>SmMADS2
>SmMADS10
>SmMADS1
>PPM1_Pp012G078200
>PPM2_Pp003G125000
>PpMADS1_Pp004G002000
>PpMADS_S
>PPMC5_Pp017G019900
>PPMC6_Pp014G056100
>PpMADS2_Pp012G080000
>PpMADS3_Pp003G124300
>PPM3_Pp014G086000
>PPM4_Pp001G158500
>PPM6_Pp008G025800
>PPM7_Pp017G043700
>PPMA8_Pp008G025900
>PPMA9_Pp003G124400
>PPMA11_Pp001G158400
>PPMA12_Pp004G013900
>Pp014G086100
>PI
>AP3
>AGL2_SEP1
>AGL6
>AGL8_FUL
>AGL25_FLC
>AG
>AGL12
>AGL15
>AGL17
>AGL20_SOC1
>AGL22_SVP
>AGL32_ABS
>AGL104
>AGL67
>AGL66
>AGL30
>AGL94
>AGL33
>AGL65
>OsMADS2
>OsMADS16
>OsMADS55
>OsMADS30
>OsMADS6
>OsMADS8
>OsMADS15
>OsMADS23
>OsMADS26
>OsMADS3
>OsMADS50
>OsMADS62
>OsMADS63
>OsMADS68
>AtSTK_Homeotic_D
>AtAP1_Homeotic_A
>OsMADS14_Homeotic_A
>MpMADS1
>MpMADS2
>Azfi_s0028.g024032_manually_reannotated
>Azfi_s0062.g035272 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0028.g024032 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0003.g007710 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0003.g007732 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0704.g082655 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0242.g059800 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0078.g038112 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0038.g026346 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0016.g014286 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0001.g000595 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0034.g025354 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0009.g011788 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0073.g037211 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0684.g082058 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0009.g011751 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0009.g011735 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0009.g011792 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0009.g011725 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0009.g011722 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0038.g026344 sequence match in blast db Azolla filiculoides protein v1.1
>Azfi_s0016.g014288 sequence match in blast db Azolla filiculoides protein v1.1
>MA_101463g0010
>MA_10256834g0010
>MA_104138g0010
>MA_10434339g0010
>MA_113509g0010
>MA_130755g0010
>MA_13933g0010
>MA_17919g0010
>MA_19007g0010
>MA_19007g0020
>MA_211156g0010
>MA_254338g0010
>MA_276701g0010
>MA_333471g0010
>MA_35712g0010
>MA_458668g0010
>MA_502016g0010
>MA_57493g0010
>MA_629987g0010
>MA_6805g0010
>MA_758606g0010
>MA_78010g0010
>MA_9284799g0010
>MA_9310518g0010
>MA_95674g0010
>Sacu_v1.1_s0003.g001660 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0003.g001705 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0004.g002164 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0004.g002217 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0006.g003070 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0006.g003425 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0006.g003429 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0007.g003739 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0008.g004031 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0014.g006239 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0027.g009703 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0027.g009704 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0033.g011014 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0043.g012907 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0043.g012908 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0050.g013783 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0087.g018504 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0115.g020974 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0129.g021995 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0207.g025791 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0239.g026596 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0239.g026597 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0271.g027022 MADS-box transcription factor [0.179]
>Sacu_v1.1_s0932.g027847 MADS-box transcription factor [0.179]
>Afi_v2_s1G000300.1
>Afi_v2_s1G000320.1
>Afi_v2_s1G000340.1
>Afi_v2_s1G000630.1
>Afi_v2_s1G000640.1
>Afi_v2_s2G003330.1
>Afi_v2_s6G005990.1
>Afi_v2_s6G006070.6
>Afi_v2_s19G007500.1
>Afi_v2_s20G002360.1
>Afi_v2_s25G003350.1
>Afi_v2_s26G003990.1
>Afi_v2_s28G002220.1
>Afi_v2_s37G001110.1
>Afi_v2_s46G001100.2
>Afi_v2_s70G000030.1
>Afi_v2_s132G000050.1
>Afi_v2_s1737G000020.1
>Afi_v2_s3080G000010.1
>Afi_v2_s3081G000010.1
>QXSZ-2000951-Mantoniella_squamata
>IRZA-2060948-Proteomonas_sulcata
>MXDS-2024758-Spermatozopsis_exsultans
>VBLH-2005464-Cladophora_glomerata
>AZZW-2004234-Chlorokybus_atmophyticus
>DRGY-2036441-Chaetosphaeridium_globosum
>IJMT-2090153-Aphanochaete_repens
>MNNM-2042156-Cosmarium_granatum
>YDCQ-2002995-Cephaleuros_virescens
>OBUY-2002084-Porphyridium_cruentum
>AKCR-2016810-Parachlorella_kessleri
>IHJY-2008922-Kappaphycus_alvarezii
>QFND-2006855-Cyanophora_paradoxa-CCAC_0074
>LBRP-2057637-Chloromonas_reticulata-A
>JEBK-2024384-Eucheuma_denticulatum
>PUAN-2001825-Pedinomonas_tuberculata
>BAJW-2043867-Isochrysis_sp.
>WDGV-2005229-Cosmarium_subtumidum
>JTIG-2001880-Bryopsis_plumosa
>MMKU-2003215-Nephroselmis_olivace
>BAJW-2009533-Isochrysis_sp.
>BCYF-2040514-Chlamydomonas_cribrum
>IJMT-2016960-Aphanochaete_repens
>AJAU-2006740-Helicodictyon_planctonicum
>VFIV-2046147-Fritschiella_tuberosa
>RSOF-2042181-Glaucosphaera_vacuolata
>IHJY-2008923-Kappaphycus_alvarezii
>ACRY-2040985-Pteromonas_sp.
>UTRE-2013475-Chloromonas_tughillensi
>HJVM-2025949-Cosmarium_ochthodes
>VBLH-2012088-Cladophora_glomerata
>ZRMT-2008399-Mougeotia_sp.
>UTRE-2013474-Chloromonas_tughillensi
>VNAL-2009437-Gracilaria_asiatica
>BAJW-2010911-Isochrysis_sp.
>FPCO-2005600-Interfilum_paradoxum
>ALZF-2019222-Halochlorococcum_marinum
>YTYU-2001928-Cyanophora_paradoxa-CCAC_0091
>JOJQ-2006430-Cylindrocystis_cushleckae
>QWFV-2006406-Bambusina_borreri
>JKHA-2005702-Cyanoptyche_gloeocystis
>ROZZ-2057063-Chroomonas_sp.
>KYIO-2005290-Mesostigma_viride
>RUIF-2012753-Carteria_obtusa
>MWXT-2008060-Chara_vulgaris
>BFIK-2027273-Entransia_fimbriat
>BWVJ-2023841-Betaphycus_gelatinae
>PQED-2054148-Gloeochaete_wittrockiana
>LLEN-2041431-Pavlova_lutheri-SAG_926_1
>GFUR-2015336-Chloromonas_subdivisa
>EEJO-2008116-Neochloris_oleoabundans
>ISPU-2046449-Volvox_globator
>PFUD-2035957-Geminella_sp.
>RPRU-2012687-Staurodesmus_omearii
>UYFR-2005123-Symphyocladia_latiuscula
>CKXF-2001315-Gymnogongrus_ftabelliformis
>XDLL-2011037-Oogamochlamys_gigantea
>IAYV-2002669-Rhodomonas_sp.
>TPHT-2126301-Spirotaenia_sp.
>UYFR-2005122-Symphyocladia_latiuscula
>VZWX-2001427-Ceramium_kondoi
>IRZA-2055933-Proteomonas_sulcata
>USIX-2000378-Neochlorosarcina_sp.
>YDCQ-2002994-Cephaleuros_virescens
>NKXU-2026356-Trebouxia_arboricola
>RFAD-2024388-Pavlova_lutheri-CCAC_0028
>UTRE-2013473-Chloromonas_tughillensi
>FQLP-2004170-Klebsormidium_subtile
>IKWM-2001212-Gracilaria_lemaneiformi
>YSQT-2007847-Penium_exiguum
>RFAD-2026715-Pavlova_lutheri-CCAC_0028
>STKJ-2019062-Zygnema_sp.-A
>TPHT-2010795-Spirotaenia_sp.
>IAYV-2046290-Rhodomonas_sp.
>BAKF-2055440-Cryptomonas_curvata
>NBYP-2056123-Mesotaenium_kramstei
>DUMA-2035243-Tetraselmis_cordiformis
>NMAK-2036680-Pavlova_lutheri-UTEX_LB_1293
>JRGZ-2005826-Chlamydomonas_moewusii
>MJMQ-2038263-Hemiselmis_virescens
>ISIM-2004623-Nephroselmis_pyriformis
>JKHA-2002021-Cyanoptyche_gloeocystis
>HYHN-2013384-Prasinoderma_coloniale
>KUJU-2044621-Gonium_pectorale
>XTON-2045004-Pediastrum_duplex-CCAC_0147
>VIAU-2063194-Carteria_crucifera
>QPDY-2028842-Coleochaete_irregularis
>DZPJ-2049882-Cylindrocapsa_geminella
>TNAW-2072860-Pyramimonas_parkeae
>BAJW-2033085-Isochrysis_sp.
>XAXW-2005841-Polysiphonia_japonica
>ISGT-2008818-Hormidiella_sp.
>ENAU-2017818-Spermatozopsis_similis
>RTLC-2024115-Rhodella_violacea
>RNAT-2007499-Eudorina_elegans
>IAYV-2002668-Rhodomonas_sp.
>ZIVZ-2044157-Phacotus_lenticularis
>VQBJ-2041758-Coleochaete_scutata
>LLEN-2000140-Pavlova_lutheri-SAG_926_1
>EGNB-2027623-Scourfieldia_sp.
>JMUI-2003983-Stigeoclonium_helveticum
>AJUW-2052652-Chloromonas_rosa
>JTIG-2032291-Bryopsis_plumosa
>BAKF-2006321-Cryptomonas_curvata
>OFUE-2041196-Lobochlamys_segnis
>XIVI-2096399-Cymbomonas_sp.
>JKHA-2005703-Cyanoptyche_gloeocystis
>JWGT-2005549-Volvox_aureus-M1028
>LLEN-2041277-Pavlova_lutheri-SAG_926_1
>HVNO-2016659-Tetraselmis_chui
>JOJQ-2006431-Cylindrocystis_cushleckae
>IAYV-2007982-Rhodomonas_sp.
>MOYY-2018826-Pleurotaenium_trabecul
>IJMT-2002615-Aphanochaete_repens
>PVGP-2004381-Porphyridium_purpureum
>JJZR-2002753-Rhodochaete_parvula
>RYJX-2009960-Pandorina_morum
>ISHC-2045472-Staurastrum_sebaldi
>RUIF-2054074-Carteria_obtusa
>RAWF-2043643-Uronema_belka
>TSBQ-2057925-Chlamydomonas_sp.-M2762
>URSB-2014843-Grateloupia_turuturu
>VQBJ-2001736-Coleochaete_scutata
>PVGP-2020165-Porphyridium_purpureum
>XOAL-2033148-Dolichomastix_tenuilepi
>PIVW-2001489-Ceratopteris_thalictroides
>IRBN-2000158-Scapania_nemorosa
>FFPD-2048003-Ceratodon_purpureus
>CBAE-2050356-Huperzia_myrisinites
>ENQF-2084602-Lycopodium_annotinum
>IRBN-2163827-Scapania_nemorosa
>PIVW-2016418-Ceratopteris_thalictroides
>IHJY-2008922-Kappaphycus_alvarezii
>QFND-2006855-Cyanophora_paradoxa-CCAC_0074
>FFPD-2013099-Ceratodon_purpureus
>JVSZ-2007618-Equisetum_hymale
>JVSZ-2012388-Equisetum_hymale
>JVSZ-2127794-Equisetum_hymale
>JVSZ-2117507-Equisetum_hymale
>AKXB-2066716-Phaeomegaceros_coriaceus
>QMWB-2049930-Anomodon_attenuatus
>IHJY-2008923-Kappaphycus_alvarezii
>JVSZ-2004872-Equisetum_hymale
>LGOW-2015978-Schistochila_sp.
>ANON-2035464-Leiosporoceros_dussii-B
>ZFRE-2001130-Phaeoceros_carolinianus-sporophyte_1575
>RDOO-2005177-Racomitrium_varium
>CVEG-2124355-Azolla_cf._caroliniana
>JVSZ-2007619-Equisetum_hymale
>PIVW-2002304-Ceratopteris_thalictroides
>PIVW-2019843-Ceratopteris_thalictroides
>YTYU-2001928-Cyanophora_paradoxa-CCAC_0091
>BNCU-2083596-Radula_lindenbergia
>RDOO-2000123-Racomitrium_varium
>JVSZ-2114254-Equisetum_hymale
>RDOO-2000812-Racomitrium_varium
>JKHA-2005702-Cyanoptyche_gloeocystis
>DXOU-2036490-Nothoceros_aenigmaticus
>ZFRE-2019689-Phaeoceros_carolinianus-sporophyte_1575
>PQED-2054148-Gloeochaete_wittrockiana
>PIVW-2004036-Ceratopteris_thalictroides
>JVSZ-2122133-Equisetum_hymale
>ENQF-2007944-Lycopodium_annotinum
>RDOO-2000122-Racomitrium_varium
>BGXB-2008746-Plagiomnium_insigne
>DXOU-2000995-Nothoceros_aenigmaticus
>PIVW-2020998-Ceratopteris_thalictroides
>UYFR-2005123-Symphyocladia_latiuscula
>UPMJ-2005050-Pseudolycopodiella_caroliniana
>FFPD-2013098-Ceratodon_purpureus
>PIVW-2004304-Ceratopteris_thalictroides
>DXOU-2004607-Nothoceros_aenigmaticus
>PIVW-2000174-Ceratopteris_thalictroides
>ENQF-2004632-Lycopodium_annotinum
>JVSZ-2127811-Equisetum_hymale
>UYFR-2005122-Symphyocladia_latiuscula
>UPMJ-2018444-Pseudolycopodiella_caroliniana
>JVSZ-2014698-Equisetum_hymale
>JVSZ-2127731-Equisetum_hymale
>PIVW-2000450-Ceratopteris_thalictroides
>JVSZ-2000819-Equisetum_hymale
>QMWB-2056566-Anomodon_attenuatus
>AKXB-2062983-Phaeomegaceros_coriaceus
>RDOO-2000811-Racomitrium_varium
>CBAE-2013716-Huperzia_myrisinites
>UPMJ-2005049-Pseudolycopodiella_caroliniana
>CBAE-2013582-Huperzia_myrisinites
>RDOO-2005175-Racomitrium_varium
>PIVW-2008000-Ceratopteris_thalictroides
>UPMJ-2014960-Pseudolycopodiella_caroliniana
>BGXB-2078030-Plagiomnium_insigne
>PIVW-2004305-Ceratopteris_thalictroides
>AKXB-2008754-Phaeomegaceros_coriaceus
>BNCU-2018766-Radula_lindenbergia
>CBAE-2058798-Huperzia_myrisinites
>BNCU-2011264-Radula_lindenbergia
>HVBQ-2129775-Tetraphis_pellucida
>JKHA-2002021-Cyanoptyche_gloeocystis
>LGOW-2012213-Schistochila_sp.
>PIVW-2004303-Ceratopteris_thalictroides
>PIVW-2001491-Ceratopteris_thalictroides
>ENQF-2004631-Lycopodium_annotinum
>UPMJ-2007916-Pseudolycopodiella_caroliniana
>PIVW-2001490-Ceratopteris_thalictroides
>JPYU-2008394-Marchantia_polymorpha
>FFPD-2013100-Ceratodon_purpureus
>BNCU-2016430-Radula_lindenbergia
>UPMJ-2002417-Pseudolycopodiella_caroliniana
>PIVW-2002392-Ceratopteris_thalictroides
>CBAE-2009937-Huperzia_myrisinites
>UPMJ-2011561-Pseudolycopodiella_caroliniana
>JVSZ-2000820-Equisetum_hymale
>JVSZ-2008957-Equisetum_hymale
>JVSZ-2014398-Equisetum_hymale
>PIVW-2019502-Ceratopteris_thalictroides
>ENQF-2014908-Lycopodium_annotinum
>FFPD-2057947-Ceratodon_purpureus
>Chlre_v5_5-Cre11_g467577_t1_1-Chlamydomonas_reinhardtii
>BGXB-2008691-Plagiomnium_insigne
>CVEG-2033633-Azolla_cf._caroliniana
>BGXB-2010078-Plagiomnium_insigne
>ENQF-2015023-Lycopodium_annotinum
>NRWZ-2002384-Metzgeria_crassipilis
>BNCU-2085227-Radula_lindenbergia
>PIVW-2004035-Ceratopteris_thalictroides
>LGOW-2012215-Schistochila_sp.
>UPMJ-2008884-Pseudolycopodiella_caroliniana
>CBAE-2058098-Huperzia_myrisinites
>PIVW-2003993-Ceratopteris_thalictroides
>JPYU-2004301-Marchantia_polymorpha
>JKHA-2005703-Cyanoptyche_gloeocystis
>PIVW-2003992-Ceratopteris_thalictroides
>PIVW-2009413-Ceratopteris_thalictroides
>CBAE-2012867-Huperzia_myrisinites
>ENQF-2081710-Lycopodium_annotinum
>ZFRE-2001131-Phaeoceros_carolinianus-sporophyte_1575
>PIVW-2000173-Ceratopteris_thalictroides
>JVSZ-2014659-Equisetum_hymale
>HVBQ-2127364-Tetraphis_pellucida
>UPMJ-2017788-Pseudolycopodiella_caroliniana
>PVGP-2004381-Porphyridium_purpureum
>CBAE-2051190-Huperzia_myrisinites
>PIVW-2004302-Ceratopteris_thalictroides
>PIVW-2009414-Ceratopteris_thalictroides
>FFPD-2013097-Ceratodon_purpureus
>PIVW-2007999-Ceratopteris_thalictroides
>NRWZ-2014421-Metzgeria_crassipilis
>PIVW-2001493-Ceratopteris_thalictroides
>LGOW-2005840-Schistochila_sp.
>RDOO-2015055-Racomitrium_varium
>CVEG-2123825-Azolla_cf._caroliniana
>BNCU-2082902-Radula_lindenbergia
>PVGP-2020165-Porphyridium_purpureum
>JVSZ-2116154-Equisetum_hymale
>CVEG-2010487-Azolla_cf._caroliniana
>BGXB-2008745-Plagiomnium_insigne
In [13]:
# for the selection workflow
inseq="$inseq"_selection-basal-v3_guide-v4
echo $inseq
MIKC_orthogroup_selection-basal-v3_guide-v4
In [14]:
head data/$inseq.fasta
>tr|Q5KU24|Q5KU24_COLSC MADS-box transcription factor CsMADS1 OS=Coleochaete scutata OX=3125 GN=csmads1 PE=2 SV=1
MGRGKIEIRRIENATSRQVTFSKRRNGLLKKAYELSVLCDVDIAVIVFSPTGKLFQYASS
SMKEILERYEQVPPEQKEKGSQRLDNMDYLNREVAKLRNEVEHKYHEARQLEGEDLDRLG
VYELEQLEQKLSNSMRRIRGRKDELMKAELEGLRKQVADMETALVGAASFDGRPLSGSSN
YLLQSIPGIRTMPPSSLGGMNPASTSLQLGSDRLFGNRGVELHDRSASDESPVMTNRMSV
DFAQAPREMSGVDLSGSPVPPWKSQAAAAAQQEWKNQASSPTDWKVTNTEHLDSWPKAPA
PTPEWKSTSVQPEWKNQSSPSSEWKPLDWMYHGPQD
>tr|Q5KTX4|Q5KTX4_9VIRI MADS-box protein CpMADS1 (Fragment) OS=Closterium peracerosum-strigosum-littorale complex OX=34146 GN=CpMADS1 PE=2 SV=1
MGRGKIEIRKIDNATTRQVTFSKRRNGLLKKAYELAVLCDVEIGVIIFSATGKLFQYAST
NMDSIVERYRRLALETGKDPRPPWQQQNPPQSTGLGAQHGQHNKHGKEKPGQLQARTQQQ
In [15]:
grep '>' -c data/$inseq.fasta
425

2. Aligning

Now we have our fasta file with a feasible amount of sequences. Next step is aligning these. While this may seem trivial, the method of aligning can actually influence your end results quite a bit. Roughly speaking there is several alignment algorithms:

  1. progressive
    • mafft
    • clustal?
  2. pair-wise
    • mafft
    • ...
  3. ...

Especially for bigger datasets, I prefer mafft for it is simply very fast and gave me good results in the past. But by all means try more ways if you get odd results.

If you have a good idea of what you're doing, and you want to run multiple alignments in a loop and go have lunch, have a look at section 2.2.

2.1 running alignments (one by one).

2.1.1. MAFFT online

If you find mafft options and parameters confusing, and/or you have difficulty making alignments, then you may tre to use the online service here. The online MAFFT service does a good job at explaining the parameters and has a nice visualisation as well! So read the webpage, and choose your options and parameters aided by the explanations in the webpage. When you submit your job, the mafft command issued in the background is actually shown to you! Hence you can copy paste that command here if you'd like. That's especially useful when the server is under high load, in this notebook you may choose to use all threads available on your computer --threads $(nproc).

Using the MSAviewer that you can open after running your alignment on this server, you can even interactivelly trim. From a reproducibility/scaling point of view, this is not ideal, but to get a feeling for what you are doing, it is very usefull. Just make sure you keep a record of what you do, and keep intermediate results with clear names.

2.1.2 MAFFT local

I'll start with showing you my go-to approach. First, have a look at the manual.

Next I'll make a directory to store the untrimmed (hence raw) alignments and run the alignment on all available CPU cores.

I like to do this in 'if loops' to prevent re-doing things unnecessarily.

In [16]:
# If jupyter cannot find mafft, but it is installed via conda, try this pragmatic fix:
conda activate phylogenetics
(phylogenetics) 

e-insi

choosing einsi method now, for it is used with multiple conserved domains where linsi focusses on a single conserved domain.

In [17]:
if    [ ! -d ./data/alignments_raw/ ]
then  mkdir  ./data/alignments_raw
fi
if    [ ! -f "./data/alignments_raw/$inseq"_aligned-mafft-linsi.fasta ]
then  einsi --thread $(nproc) data/$inseq.fasta > ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.fasta
fi
(phylogenetics) outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.458
alg=N, model=BLOSUM62, 2.00, -0.00, -0.00, noshift, amax=0.0
8 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  420 / 425
done.

Progressive alignment ... 
STEP   349 /424 (thread    1) 
Reallocating (by thread 6) ..done. *alloclen = 3153
STEP   424 /424 (thread    6) 
done.
tbfast (aa) Version 7.458
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
8 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  420 / 425
Segment   1/  1    1-2650
008-0846-0 (thread    5) worse         
Converged2.
done
dvtditr (aa) Version 7.458
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
8 thread(s)


Strategy:
 E-INS-i (Suitable for sequences with long unalignable regions, very slow)
 Iterative refinement method (<16) with LOCAL pairwise alignment with generalized affine gap costs (Altschul 1998)

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

Parameters for the E-INS-i option have been changed in version 7.243 (2015 Jun).
To switch to the old parameters, use --oldgenafpair, instead of --genafpair.

(phylogenetics) 

In [18]:
ls ./data/alignments_raw
MIKC_orthogroup_selection-algae_guidev3_aligned-mafft-ginsi.fasta
MIKC_orthogroup_selection-algae_guidev3_aligned-mafft-linsi-dash.fasta
MIKC_orthogroup_selection-algae_guidev3_aligned-mafft-linsi.fasta
MIKC_orthogroup_selection-basal-v1_guide-v2_aligned-mafft-linsi.fasta
MIKC_orthogroup_selection-basal-v2_guide-v3_aligned-mafft-auto.fasta
MIKC_orthogroup_selection-basal-v2_guide-v3_aligned-mafft-linsi.fasta
MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi.fasta
MIKC_orthogroup_selection-v1_guide-v1_aligned-mafft-linsi.fasta
MIKC_orthogroup_selection-v2_guide-v2_aligned-mafft-linsi.fasta
(phylogenetics) 

In [20]:
head ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.fasta
>tr|Q5KU24|Q5KU24_COLSC MADS-box transcription factor CsMADS1 OS=Coleochaete scutata OX=3125 GN=csmads1 PE=2 SV=1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------M----------------------------------------
(phylogenetics) 

3. Alignment trimming

First I remove collumns with hardly any info.

In [21]:
if    [ ! -d data/alignments_trimmed ]
then  mkdir  data/alignments_trimmed 
fi

# define appendix only once here:
trimappendix='trim-gt1'


for a in "data/alignments_raw/$inseq"_aligned*.fasta
do  appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    if    [ ! -f data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta ]
    then  echo "trimming alignment $a"
          sed -i 's/ /_/g' $a
          trimal -in $a   \
                 -out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta \
                 -gt .1 \
                 -htmlout data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".html
    fi
done
(phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) trimming alignment data/alignments_raw/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi.fasta
(phylogenetics) 

image.png

That already gives me the M, I and K domains. Now I'll just manually remove everything that doesn't have a K domain or looks suspicious too me.

In [23]:
ls data/alignments_trimmed/$inseq*.fasta
data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1.fasta
data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual.fasta
(phylogenetics) 

Now This is how it looks trimmed:

image.png

Removed 47 sequences by hand (in seqview) for they lacked major parts of the MADS DNA binding domain.

4. Fast tree building

Here we'll make fast trees: not acurate, no bootstraps, but fast. This gives us an idea of the output and how we will process it. Building 'propper' trees can take days sometimes weeks, so it's better to be sure you have all sequences in there you want before you start.

I use two ways to make thise fast trees, first with a program called fasttree and second with the programm iqtree with the -fast parameter. My gut feeling is that the latter is a bit more acurate but takes a couple of minutes. Fasttree takes seconds.

I arbitrarily consider trees to be analyses and not data, hence I store these in the analyses directory.

Since these trees run fast (just take a second to consider how rediculous that sounds) I propose to run these in loops again, taking all the trimmed alignments that were made earlier. The trees run in parallel on one CPU. If you're running many trees (way more than you have computing cores) then don't run these in the background. Practically, that means removing the & character almost at the end of the loop.

4.1 fasttree

In [ ]:
for a in data/alignments_trimmed/"$inseq"_aligned*.fasta
do  echo "making a fasttree of file $a"
    appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    echo $appendix
    
    if   [ ! -d   analyses/"$inseq"_fasttrees/"$appendix" ]
    then mkdir -p analyses/"$inseq"_fasttrees/"$appendix"
    fi
    
    prefix=analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_fasttree
    if   [ ! -f "$prefix".tree ]
    then nice fasttree -log "$prefix".log \
                       $a \
                       >  "$prefix".tree \
                       2> "$prefix".stderr &
    fi
done
wait
In [ ]:
tail analyses/"$inseq"_fasttrees/*/*fasttree.stderr
In [ ]:
tree analyses/"$inseq"_fasttrees/

4.2 IQtree -fast

And here is the same but for running iqtree. I picked some random model here, but substitute it by anything you like better or have good experience with it the past.

In [ ]:
for a in data/alignments_trimmed/"$inseq"_aligned*.fasta
do  echo "making a iqtree fast tree of file $a"
    appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
    echo $appendix
    if   [ ! -d   analyses/"$inseq"_fasttrees/"$appendix" ]
    then mkdir -p analyses/"$inseq"_fasttrees/"$appendix"
    fi
    
    iqprefix=analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_iqtree-fast
    if   [ ! -f "$iqprefix".iqtree ]
    then nice iqtree -s $a -fast \
                     -m 'LG+R7' \
                     -pre "$iqprefix" \
                     > "$iqprefix".stdout \
                     2> "$iqprefix".stderr &
    fi
done
wait

4.3 Visualise your fast trees.

To visualise your trees, you perhaps already have something installed like mega, seaview, etc. Otherwise you can upload the tree file to iToL (my prefered method) or any other website that visualises trees. See section 6 for uploading your trees to iToL.

Alternativelly, we can try to get a quick snapshot here in the notebook:

4.3.1 Tree comparisons

You may want to run a quick comparison table like so:

In [ ]:
ete3 compare --unrooted -t ./analyses/MYBs-nina_fasttrees/*/*fasttree.tree -r ./analyses/MYBs-nina_fasttrees/*/*iqtree-fast.treefile

Collumns abbreviations:

  • source target tree used
  • ref reference tree used to compare with
  • eff.size Effective tree size used for comparisons (after pruning not shared items)
  • nRF Normalized Robinson-Foulds distance (RF/maxRF)
  • RF Robinson-Foulds symmetric distance
  • maxRF maximum Robinson-Foulds value for this comparison
  • %src_br frequency of edges in target tree found in the reference (1.00 = 100% of branches are found)
  • %ref_br frequency of edges in the reference tree found in target (1.00 = 100% of branches are found)
  • subtrees Number of subtrees used for the comparison (applies only when duplicated items are use to decomposed target trees)
  • treekoD Average distance among all possible subtrees in the original target trees to the reference tree (TreeKO speciation distance).

Big numbers or fractions are usually not a good sign, but can happen. Just be extra carefull before drawing any conclusions.

5. Building trees with IQtree

Finally, we're at the stage to build propper maximum likelyhood phylogenetic trees! Based on your previous results, you should have one or two trimmed alignments you want to make a tree of. There is several choices to make still: a model of evolution and a bootstrapping method.

modelfinder

IQtree is a state-of-the art tree buildling program, which has a model finder algorithm included! This can take a couple of hours, so be sure to do this only once. There is two model finder options, a quick one with some often used models: -m TEST or an extended modelfinder, using more models of evolution and substitution: -m MFP. I recommend the latter. Once you have your best-fit model (for example: 'LG+R7') then use this model when you build more trees from the same alignment: -m 'LG+R7'

bootstrapping

Normal or 'non-parametric' bootstrapping can take quite a long time; I have had trees running for weeks. Hence there is alternatives that are a lot faster but might over or underestimate the bootstrap values if your alignment doesn't fit your model well. To use 'normal bootstraps' the minimum is 100. That's why I like to to 200 to be safe, by adding the option -b 200.

Alternativelly, there is the 'ultrafast bootstrap' option in IQtree. The minumum for this is 1000 bootstraps, so I'd like to do double by including the parameter: -bb 2000. Additionally, I highly recommend also running the approximate likelyhood ratio test for 2000 bootstraps at the same time by including parameters -alrt 2000. This adds a minimal amount of run time and makes interpretation of your tree a lot more reliable.

As the IQtree FAQ says: typically you start believing a clade when the ultra fast bootstraps => 95 and alrt => 80. Interpretation of these values is not linear like 'normal' bootstrap, hence if you lower the threshold of ultrafast bootstraps to 90, you will likely enormously overestimate your results.

other command-line options

In the commandline I wrote below, I instruct iqtree to use no more CPU cores than your computer has, but also to find the optimum amount of cores (more is not always better). Second, a prefix is defined to store the different trees that IQtree wil make.

More info

running IQtree

Now these are all trimmed alignments you have available. Choose one to start with (based on your fasttrees or inspections of your alignments).

Make sure that

  1. the path to this alignment is the variable $a
  2. you choose an appendix based on your iqtree settings
In [3]:
inseq=MIKC_orthogroup_selection-basal-v3_guide-v4
ls data/alignments_trimmed/"$inseq"_aligned*fasta
data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1.fasta
data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual.fasta
In [5]:
conda activate phylogenetics
(phylogenetics) 

In [15]:
a=data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual.fasta

#iqpendix='iqtree-b200'
iqpendix='iqtree-MFP-bb2000-alrt2000'

echo "making a tree of file $a"
echo "The first lines of alignment $a look like this"
head $a

file_appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
echo "Making a directory $file_appendix to store trees (name based on alignment filename)"

if   [ ! -d    analyses/"$inseq"_trees/"$file_appendix" ]
then mkdir -p  analyses/"$inseq"_trees/"$file_appendix" 
fi

iqprefix=analyses/"$inseq"_trees/"$file_appendix"/"$inseq"_"$file_appendix"_"$iqpendix"
if   [ ! -f analyses/"$inseq"_trees/"$file_appendix"/"$inseq"_"$file_appendix"_iqtree-bb1000.tree ]
then nice iqtree -s $a \
                 -m 'MFP' \
                 -bb 2000 \
                 -alrt 2000 \
                 -nt AUTO \
                 -ntmax $(nproc)  \
                 -pre  "$iqprefix" \
                 2>   "$iqprefix".stderr \
                 >    "$iqprefix".stdout && cat "$iqprefix".log | mail -s IQtree_run laura &
fi
[1]+  Done                    nice iqtree -s $a -m 'MFP' -bb 2000 -alrt 2000 -nt AUTO -ntmax $(nproc) -pre "$iqprefix" 2> "$iqprefix".stderr > "$iqprefix".stdout && cat "$iqprefix".log | mail -s IQtree_run laura
(phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) making a tree of file data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual.fasta
(phylogenetics) The first lines of alignment data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual.fasta look like this
(phylogenetics) >tr|Q5KU24|Q5KU24_COLSC_MADS-box_transcription_factor_CsMADS1_OS=Coleochaete_scutata_OX=3125_GN=csmads1_PE=2_SV=1
------MGRGKIEIRRIENATSRQVTFSKRRNGLLKKAYELSVLCDVDIAVIVFSPTGKL
FQYASS-SMKEILERYEQVPPEQKEK-----GSQRLDNM-DYLNREVAKLRNEVEHKYHE
ARQLEGEDLDRLGVYELEQLEQKLSNSMRRIRGRKDELMKAELEGLRKQVADMETASNYL
LQSIPGIRTMPSLGG-------------------------------NPA-----------
----------------------STSLQLGSDRLFGNRGSQAAAAAQQEWKN--ASSPTDW
KTEDSWPWVQPEWKNQ
>tr|Q5KTX4|Q5KTX4_9VIRI_MADS-box_protein_CpMADS1_(Fragment)_OS=Closterium_peracerosum-strigosum-littorale_complex_OX=34146_GN=CpMADS1_PE=2_SV=1
------MGRGKIEIRKIDNATTRQVTFSKRRNGLLKKAYELAVLCDVEIGVIIFSATGKL
FQYAST-NMDSIVERYRRLALRPPWQ-----QQNPPQSTPGQLQARTQQQRQQLSRLQAE
(phylogenetics) (phylogenetics) (phylogenetics) Making a directory aligned-mafft-einsi_trim-gt1-seqrmmanual to store trees (name based on alignment filename)
(phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) (phylogenetics) [1] 6283
(phylogenetics) 

You can have a look at the last lines of your log file like this:

In [23]:
tail -n 30 $iqprefix.log
Alternative NNI shows better log-likelihood -55313.534 > -55313.535
0.867 sec.
Creating bootstrap support values...
Split supports printed to NEXUS file analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.splits.nex
Total tree length: 153.138

Total number of iterations: 286
CPU time used for tree search: 6024.740 sec (1h:40m:24s)
Wall-clock time used for tree search: 1035.498 sec (0h:17m:15s)
Total CPU time used: 6265.956 sec (1h:44m:25s)
Total wall-clock time used: 1096.807 sec (0h:18m:16s)

Computing bootstrap consensus tree...
Reading input file analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.splits.nex...
250 taxa and 1727 splits.
Consensus tree written to analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.contree
Reading input trees file analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.contree
Log-likelihood of consensus tree: -55314.508

Analysis results written to: 
  IQ-TREE report:                analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.iqtree
  Maximum-likelihood tree:       analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.treefile
  Likelihood distances:          analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.mldist

Ultrafast bootstrap approximation results written to:
  Split support values:          analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.splits.nex
  Consensus tree:                analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.contree
  Screen log file:               analyses/MIKC_orthogroup_selection-basal-v3_guide-v4_trees/aligned-mafft-einsi_trim-gt1-seqrmmanual/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual_iqtree-MFP-bb2000-alrt2000.log

Date and Time: Sun Jun 28 18:14:53 2020
[1]+  Done                    nice iqtree -s $a -m 'MFP' -bb 2000 -alrt 2000 -nt AUTO -ntmax $(nproc) -pre "$iqprefix" 2> "$iqprefix".stderr > "$iqprefix".stdout && cat "$iqprefix".log | mail -s IQtree_run laura
(phylogenetics) 

In [70]:
grep 'START BOOTSTRAP'  $iqprefix.log -c 
grep 'START BOOTSTRAP'  $iqprefix.log
15
(phylogenetics) ===> START BOOTSTRAP REPLICATE NUMBER 1
===> START BOOTSTRAP REPLICATE NUMBER 2
===> START BOOTSTRAP REPLICATE NUMBER 3
===> START BOOTSTRAP REPLICATE NUMBER 4
===> START BOOTSTRAP REPLICATE NUMBER 5
===> START BOOTSTRAP REPLICATE NUMBER 6
===> START BOOTSTRAP REPLICATE NUMBER 7
===> START BOOTSTRAP REPLICATE NUMBER 8
===> START BOOTSTRAP REPLICATE NUMBER 9
===> START BOOTSTRAP REPLICATE NUMBER 10
===> START BOOTSTRAP REPLICATE NUMBER 11
===> START BOOTSTRAP REPLICATE NUMBER 12
===> START BOOTSTRAP REPLICATE NUMBER 13
===> START BOOTSTRAP REPLICATE NUMBER 14
===> START BOOTSTRAP REPLICATE NUMBER 15
(phylogenetics) 

In [78]:
grep 'Total CPU time used'  $iqprefix.log -c 
grep 'Total CPU time used'  $iqprefix.log
14
(phylogenetics) Total CPU time used: 13791.055 sec (3h:49m:51s)
Total CPU time used: 3679.373 sec (1h:1m:19s)
Total CPU time used: 6833.649 sec (1h:53m:53s)
Total CPU time used: 4335.747 sec (1h:12m:15s)
Total CPU time used: 8584.484 sec (2h:23m:4s)
Total CPU time used: 6662.962 sec (1h:51m:2s)
Total CPU time used: 6010.418 sec (1h:40m:10s)
Total CPU time used: 8221.260 sec (2h:17m:1s)
Total CPU time used: 6261.218 sec (1h:44m:21s)
Total CPU time used: 5998.986 sec (1h:39m:58s)
Total CPU time used: 8667.524 sec (2h:24m:27s)
Total CPU time used: 9995.875 sec (2h:46m:35s)
Total CPU time used: 4633.229 sec (1h:17m:13s)
Total CPU time used: 9898.357 sec (2h:44m:58s)
(phylogenetics) 

Are you content with your tree? Great news! If you want to do another run, I recommend copying the cell above and editing the copy. That way you keep the code for all trees you made. Don't forget to explain what you observed, why you're making a new tree, and what you're changing (remember this is your labjournal).

tree storage

For tree storage and sharing, I have yet to encounter a better tool than EMBLs iToL. It's a great interface for exploring and sharing trees with colleagues. You can browse to the treefile IQtree created on your computer and upload it to iToL. Alternativelly, you can copy paste the contents of the file to iToL. Make sure to keep the original filename as well! This file name now contains a brief summary of how this tree was made.

conclusions

The tree of the alignment I send before (basal-v3) finished running now. It is online here: https://itol.embl.de/tree/13121159166346421593419936

The new algal outgroup nicely clusters together except for a few spurious oddities. Inside the algal outgroup, there is arabidopsis AGL33, which seems like MADSbox only protein in my alignment. I haven't been too strict on the guide sequences, but this one will have to go out if we're only comparing MIKC genes. Working our way through the tree starting at the algae clade, there is a big polyphyletic clade containing few Azolla sequences. This clade contains MpMADS1, AGL 30 65 66 67 94 104, and several Selaginella MADS sequences: 2 4 and 10. Then come the homeotic B genes in a clade containing Pi AP3 SVP. Then finally, there is the clade of flowering genes (now containing FLC) and close to that our sequence of interest in the CRM7 clade. There is a paper copy on your desk if you'd like on which I scrabbled a bit, but other than that it is no different than the online version.

There is now two new trees running. First I started a 100 normal bootstrap tree on an alignment like this, however, I removed all Azfiv2 sequences, and Azfiv1 and Salvinia sequences which had no clear K domain. This should be finished sometime tomorrow evening. Normal bootstraping is feasible for an alignment like this, and it's just a lot safer than the fancy methods which may overestimate support values. Especially when sequences have diverged a lot. Hopefully, the difference is marginal. In any case, the support estimations will be more accurate.

Second, I think it is worth trying a stricter trimming on this. Topology wise, we have reasonable monophyletic clades on the macro level. Yet these clades don't very nicely reflect speciations patterns one would expect. I suspect that gappy regions may mess up the speciation patterns and artificially add to branch lengths. I'll trim with gap threshold 50% and run a tree on UFbootstraps+SHALRT so we should now tomorrow morning.

Some other annoyances: The only gymnosperm, Picea abies, actually behaves oddly at times. Ideally, I would remove it and replace it by a few others, but who has the time and will it add that much... Perhaps it is worth adding just one other gymnosperm, to see if they cluster together. I have two angiosperms, rice and arabidopsis. Even in the odd places, they cluster together which gives me some extra confidence they are placed correctly. The algal root perhaps is a bit overkill, we could do with a couple less.